**Assignment 5** Student name: Kelvin Shen (#) Overview In this assignment, I implement a few different techniques that manipulate images on the manifold of natural images. First, I invert a pre-trained generator to find a latent variable that closely reconstructs the given real image. In the second part, I take a hand-drawn sketch and generate an image that fits the sketch accordingly. Note: __Two__ Bells and Whistles are completed and are demonstrated at the bottom. (#) Part 1: Inverting the Generator In this part, I solve an optimization problem to reconstruct the image from a particular latent code. I use the LBFGS solver to solve the non-convex optimization problem. (##) Ablation on Losses The loss function is defined as $$ L(G(z), x) = w_{perc}L_{perc} + w_{L1}L_1 $$ where $L_{perc}$ is the same perceptual loss we used in style transfer in the previous assignment, and $L_1$$ is the L-1 norm as the pixel loss. Intuitively, the perceptual loss measures the content distance between two images at a certain layer. In this assignment, we use conv_5 layer in the pre-trained VGG-19 net to measure the distance after extracting features on both input and target images. I show an ablation on various combinations of the two losses below, where I fix the weight of $L_1$ loss as $10$ and experiment with different $w_{perc}$. I use StyleGAN as the model and $z$ as the latent space across all experiments.
Target 0.0001 0.001 0.01 0.1 1
We can see that $w_{perc} = 0.01$ gives the best reconstruction result. (##) Ablation on Generative Models Here I also compare two different generative models, vanilla GAN vs. StyleGAN. Given the same number of iterations of optimization (1000) and the same latent space ($z$), we can see that StyleGAN gives much better reconstruction results, mainly because of its alternative GAN architecture tailored for style transfer as well as the use of adaptive instance normalization.
Target Vanilla GAN StyleGAN
(##) Ablation on Latent Space Finally, I ablate different latent space (latent code in $z$ space, $w$ space and $w+$ space) using StyleGAN. We can see that $w$ space gives relatively better results. $w$ and $w+$ are comparable except that $w+$ looks less crisp and contains less detail than $w$. Overall, $w+$ reconstructs better shape and colors within the same number of iterations on StyleGAN than $w$, because of its concatenation of different $w$ vectors with one for each layer of StyleGAN that can receive input via AdaIn.
Target $z$ $w$ $w+$
(#) Part 2: Interpolate your Cats In this part, I interpolate an image pair by converting two images into two latent vectors respectively and simply aggregate these two latent vectors through linear combination, i.e. $z' = \theta z_1 + (1-\theta) z_2$ for $\theta\in[0,1]$ where $z_1=G^{-1}(x_1), z_2=G^{-1}(x_2)$.
Source $z$ $w$ $w+$ Target
We can confirm our observation in the previous section that StyleGan with $w+$ gives the best reconstruction results compared to the other two latent space. The transitions are smooth for both shape and color. Interestingly, interpolation in $z$ space shows that it does not encode the "shortest" path from one image to the other image because it lacks capacity to capture important features. (#) Part 3: Scribble to Image In this part, we constrain the image generation with some color scribble. In particular, the scribble can be considered as a mask such that the $L_1$ loss becomes $\Vert M*G(z) - M*S \Vert_1$ where $M$ is the mask and $S$ is the scribble. Below I show results of image generation subject to scribble constraints, using StyleGan.
Scribble $w$ $w+$
Using sparser sketches, StyleGAN with either latent space is not able to generate reasonable results because silhouettes and edges only do not provide enough constraints to optimize the latent code. For example, the first three sketches provide few constraints such that using latent space $w$ does not optimize the latent code at all. While using $w+$ does optimize the latent code and generates images that satisfy the constraint, there are a lot of empty space and holes. On the other hand, given denser sketches, as shown in the remaining examples, StyleGAN with both latent space give more reasonable results. Using $w+$ is much better than $w$ because different $w$'s' for different layers help reconstruct the input image more easily. (#) Bells & Whistles -- Stable Diffusion In this part, I implement [Stable Diffusion](https://arxiv.org/abs/2112.10752) given the skeleton file. Results are shown below.
Sketch & Prompt Result 1 Result 2
A cute cat, fantasy art drawn by disney concept artists
A man thinkng about his homework
A rooster in a farm, photorealistic
A fantasy landscape with a river in the middle
A human character fights a monster with lightsaber in a video game
_Reference: The first sketch is provided on the assignment website whereas the remaining sketches are all from [reddit.com/r/StableDiffusion](https://www.reddit.com/r/StableDiffusion/). _ (#) Bells & Whistles -- High-Res Experiments I also experiment the latent code projection on high-resolution data and pre-trained weights. I show results on 128x128 and 256x256 data below and the observation is the same as in Part 1 where latent code in $w$ reconstructs more detail but is less accurate in the overall shape and color than $w+$.
Target (128') $w$ $w+$
Target (256') $w$ $w+$